This is a course about open data, open science and data science. I hope to learn more about how to use R and other tools for statistical analysis.
Click here to access my github repository
library(dplyr)
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep=",", header=TRUE)
dim(lrn14)
## [1] 166 7
We are examinig a dataset that has information about students’ attitudes towards learning statistics. The dataset has 7 variables and 166 observations. We know the age, gender and how many points each students got in the exam. In addition, we we have divided the questions measuring student’s attitude toward learning into three subcategories: deep, strategic and surface learning. Each of these three variables illustrates the average of a student’s answers to questions in each category in a scale from 1 to 5. Click here to learn more about the variables.
Now let us examine the data more in detail. From the graph below we can make following statements:
We can see that we have nearly twice as much female students than male students. We can also see that the most of the students in our data are between 20 and 30 years old an that male students are on the average slightly older than female students.
Male students have on the average a slightly better general attitude towards statistics than female students. The average attitude is 3.4 for male students and 3 for female students.
Questions related to different types of learning have almost no difference between female and male students. The only exception being surface learning, where we can see a clearly smaller variance and higher mean among female students.
The average amount of points in the final exam is 22.7. Male students were slightly better with an average of 23.5.
library(ggplot2)
library(GGally)
ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
We will build a regression model that has exam points as dependent variable. We will choose three variables correlated highly with points and use them as the explanatory variables in our model. The three most highly correlated variables are attitude, stra and surf.
The variable attitude is statistically significant with 0.1% confidence level. The other two variables are not statistically significant on their own. In order to find out whether a model with fewer variables would be better, we will have to carry out a F-test. So we will test the null hypothesis that the coefficients of the variables stra and surf are not statistically different from zero against hypothesis that either both or one of them is. The p-value of the F-statistics related to this question is 0.18 so we fail to reject the null hypothesis even with 10 percent confidence level. This means that a model with just the attitude variable as an explanatory variable would be more suitable.
library(sandwich)
library(lmtest)
library(car)
model_1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model_1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
linearHypothesis(model_1, c("surf = 0", "stra = 0"))
## Linear hypothesis test
##
## Hypothesis:
## surf = 0
## stra = 0
##
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 164 4641.1
## 2 162 4544.4 2 96.743 1.7244 0.1815
model_2 <- lm(points ~ attitude, data = lrn14)
summary(model_2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Our fitted model is: points = 11.6 + 3.5*attitude
The coefficient attached to the variable attitude is 3.5 which means that a one point increase in the general attitude towards statistics increases exam points by 11.6 points.
The intercept is 11.6 which illustrates a student’s exam points in the hypothetical situation where his or her attitude towards statistics would be zero.
The multiple R-squared of our model is 0.19 which means that almost 20% of the variance of our dependent variable is explained by our explanatory variables.
We can examine the normality assumption by examining the QQ-plot. We can see that our observations fit the line quite nice except at the start and end of the line. Based on this plot we can assume that the errors are normally distributed.
plot(model_2, which= c(2))
Another assumption of the linear regression model is that the size of error does not depend on the value of the explanatory variables. We can see whether this assumption holds by examining the Residuals vs Fitted plot. We cannot find a clear pattern in the distribution of the errors so it is safe to state that this assumption holds.
plot(model_2, which= c(1))
The Residual vs Leverage plot allows us to study whether any observation has relatively higher leverage than others. We can easily see that there is no single observations that stands out.
plot(model_2, which= c(5))
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data what we are soon going to analyse is a about student alcohol consumption and school achievements. It is combined from two larger datasets about student achievement in two different courses (mathematics and Portugese language) in secondary education of two Portugese schools. In total, we have 382 observations and 35 variables. These variables consists of students spesific information such as age and sex, as well as information about his or hers family background and living habits. On top of this, we have the student’s grades in periods 1 to 3 (average of the two courses). We have also created two variables describing how much alcohol the student uses in total and whether he or she is a high user.
Let us explore the relationships between alcohol consumption and four other interesting variables.
| Variable | Prediction |
|---|---|
| Sex | Male students drink more tha female students and are more often high users |
| Failures | The number of failures is bigger if alcohol use is high |
| Absences | The number of absences is bigger if alcohol use is high |
| Romantic | Students who are in a romantic relationship drink less |
We have nearly the same amount of female and male students, but there are more high alcohol users among male students as was predicted. X% of male students are high users compared to X% of female students.
The mean number of absences is 5.3 in the whole population. The number is clearly bigger among students who drink more. High users have nearly two times more absences of they are female and over two times if they are male. This corresponds nicely with our predictions.
The mean number of past failures is 0.3. If we compare the means between high users and non-high-users, we can see that the number is about two times bigger among high users. Again, this is in line with our previous predictions.
There are more female students in romantic relationship than male students. Overall, there are more students not in a relationship than those who are. This is the case among high users as well as predicted.
## # A tibble: 4 x 5
## # Groups: high_use [?]
## high_use sex count mean_fail mean_abs
## <lgl> <fct> <int> <dbl> <dbl>
## 1 F F 157 0.204 4.97
## 2 F M 113 0.239 3.19
## 3 T F 41 0.439 8.90
## 4 T M 71 0.479 7.41
Let us examine the summary of our fitted model:
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6507 -0.8256 -0.6131 1.0339 2.1292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8769 0.2381 -7.882 3.23e-15 ***
## failures 0.4142 0.1511 2.741 0.00613 **
## absences 0.0752 0.0182 4.132 3.59e-05 ***
## sexM 0.9757 0.2451 3.980 6.88e-05 ***
## romanticyes -0.2806 0.2655 -1.057 0.29063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.51 on 377 degrees of freedom
## AIC: 427.51
##
## Number of Fisher Scoring iterations: 4
We can see that all but one of our coefficient are statistically different from zero. The coefficients of absences and gender are significant even with a confidence level of 0.1%. The coefficient of failures is significant with a 5% confidence level and the coefficient of romantic relationship is not significant even with a 10% level. This means that we could leave this variable from our model.
If we apply the exponent function to our model parameters, We can interpret them as odds ratios. The confidence interval tells us that the value of the coefficient is within these limits with 95% of the time.
## odds_ratio 2.5 % 97.5 %
## (Intercept) 0.15 0.09 0.24
## failures 1.51 1.12 2.04
## absences 1.08 1.04 1.12
## sexM 2.65 1.65 4.32
## romanticyes 0.76 0.44 1.26
The intercept represents the likelyhood of being a high user for female students who are not in a romantic relationship and have no absences or past failures. As it is smaller than one, it means that individuals in this group are less likely to be high users than others.
Each additional past failure increases the likelyhood of being a high user. That is, student with 2 past failures is 1.51 times more likely to be a high user than a student with just one absence.
Each additional absence increases the likelyhood of being a high user. That is, student with 2 absences is 1.08 times more likely to be a high user than a student with just one absence.
Male students are 2.65 times more likely to be high users of alcohol than female students.
Students in aromantic relationship are less likely to be high users of alcohol. It is important to note that the confidence interval of this variable includes the number zero, so we cannot know for sure what kind of impact better grades actually have.
Now if we compare these results to our earlier predictions, we can see that they correspond eachother very nicely. Only deviation from our prediction is that the model suggests that the effect of romantic relationship might be ambigious.
The model we are going explore is the following. We dropped the variable romantic from our model as it was not statistically significant.
high_use ~ failures + absences + sex
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6629 -0.8545 -0.5894 1.0339 2.0428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95397 0.22819 -8.563 < 2e-16 ***
## failures 0.40462 0.15024 2.693 0.00708 **
## absences 0.07294 0.01796 4.061 4.88e-05 ***
## sexM 0.98848 0.24453 4.042 5.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 418.64 on 378 degrees of freedom
## AIC: 426.64
##
## Number of Fisher Scoring iterations: 4
Now we can test the predictive power of our model. As we can see from the table below, our model misclassified 12 non-high-users as high users and 86 high users to non-high-users. The model succeeded in predicting a high-users 26 times. The same results are displayed in the plot below labelled as “Actual values and the predictions”.
## prediction
## high_use FALSE TRUE
## FALSE 258 12
## TRUE 86 26
## [1] 0.2565445
Now we can compute the average number of wrong predictions by defining a loss function. In this case the number is 0.26 which is relatively low.
We know that nearly on thrid of the students are high users of alcohol. No we can compare the performance of our model to a simple model that classifies every third person as a high user. Now we can see that the prediction error is nearly two times bigger. So at least our model is better at predicing high users than this simple guessing strategy.
## [1] 0.4319372
## [1] 0.2643979
Now let us perform a 10-fold cross-validation on our model. The model has a test ser performance equal to the data Camp as the models have identical predictors. The prediction error is 0.2643979
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This week we are going to explore crime rates in different parts of Boston. Our dataset has 506 observations and 14 variables. In addition of crime rate, we know different attributes of the towns. You can find more information about the variables from here
| Variable | Description |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes in $1000s. |
Before starting the analysis, let’s have look of our data. Below we can see to plot matrices. From the first one, we can see the distributons of the variables in our dataset. The second one illustrates the correlations between the variables. The bigger the circle, the bigger the correlations of those variables. The color of the circles tells us the whether the correlation is negative (red) or positive (blue).
After having standardized the dataset we can instantly see that the mean of every variable is equal to zero. In addition, we can observe that the variances of the variables are all equal to one.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Now we will create a new categorical variable for crime rate and divide the dataset into two parts; one for trainng and one for testing.
Next, we will fit the linear discriminant analysis using our train set. Our dependant variable is crime rate and rest of the variables are used as predictors. The plot below illustrates the results. We can see that high crime rate is clearly separeted from the rest and it is the variable rad that predicts this separation. From low to medium high crime rate, it seems that variables zn and nox are the biggest determinants.
Now, we will examine the predictive power of this analysis. If our model had categorised every observation correctly, the table below would have only zeros execpt for the diagonal. We can see that this is not the case, even though most of the observations are correctly predicted. In addition, we can see that our model succeeded better in categorising the high crime rates as it did for the low.
## predicted
## correct low med_low med_high high
## low 17 11 0 0
## med_low 5 8 3 0
## med_high 0 12 19 1
## high 0 0 0 26
Lastly, we will perform the k-means algorithm on our dataset. We will start this by reloding the data, scaling it and calulating the distances between the variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Now, let’s run the k-means algorithm with 4 clusters. We can decide the optimal number of cluters from the plot below. The opitmal number is found at the point when the within cluster sum of squares (WCSS) drops radically. In this case, this point is when there are two clusters.
Let’s run the algorithm again and plot it. We can see that the k-means with two clusters works quite nicely. We can see that there are usually two different distributions in in each variable.
Interactive plot:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
type= 'scatter3d', mode='markers', color = train$crime)
## Another plot
km2 <- kmeans(boston_scaled[as.numeric(row.names(train)),], centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
type= 'scatter3d', mode='markers', color = km2$cluster)
This week we are going to explore the human development index (HDI). Our dataset consists of 155 countries and 8 variables. Here is the list of our variables:
| variable | Explication |
|---|---|
| Edu.r | Ratio of females and males with at least secondary education |
| Lab.r | Ratio of females and males in the labour force |
| Life.exp | Life expectancy at birth |
| Edu.exp | Expected years of schooling |
| GNI.pc | Gross National Income per capita |
| Mat.mort | Maternal mortality ratio |
| Ad.birth | Adolescent birth rate |
| Rep | Percetange of female representatives in parliament |
We will begin our analysis by examining our data graphically. We have two plot matrices below. From the first one, we can see the distributons of the variables in our dataset. The second one illustrates the correlations between the variables. The bigger the circle, the bigger the correlations of those variables. The color of the circles tells us the whether the correlation is negative (red) or positive (blue).
From these two plots, we can make for example, the following statements:
First, we will perform the principal component analysis (PCA) on the not standardized human data. If we examine the variability captured by each principal component seen below, we can see that the first component captures all of it. Now, if we take a look of our biplot, it doesn’t seem to be desirable.
| x | |
|---|---|
| PC1 | 100 |
| PC2 | 0 |
| PC3 | 0 |
| PC4 | 0 |
| PC5 | 0 |
| PC6 | 0 |
| PC7 | 0 |
| PC8 | 0 |
FIGURE I: The first PCA dimension seems to capture all of the variation
Next, let’s see how the resluts change, if we standardize the data before performing the analysis. The change is remarkable. Now we can see that some of the variance is captured by the other variables as well. The biplot looks more readable as well. This change is due the fact that one of the variables in the data (GNI.pc) has largely higher variance compared to the others, and this confuses the model.
| x | |
|---|---|
| PC1 | 53.6 |
| PC2 | 16.2 |
| PC3 | 9.6 |
| PC4 | 7.6 |
| PC5 | 5.5 |
| PC6 | 3.6 |
| PC7 | 2.6 |
| PC8 | 1.3 |
FIGURE II: The first PCA dimension has variables stretching to both directions. Maternal mortality and teenage pregnancies contribute to the opposite direction than others such as life expectancy. The number of female representatives in the parliament contributes to the second PCA dimension along with labor market ratio.
Now, we will analyse the results from our second PCA. If we take a closer look of the biplot of our second analysis, we can see that the variables Rep and Lab.r contribute to the second dimension. Variables mat.mort and Ad.birth contribute to the first dimension but to the opposite direction than the rest of the variables. These results can also be seen if examine the summary of our analysis below; The second principal component dimension has two clearly higher values and the first has two clearly positive while others are negative or close to zero.
| PC1 | PC2 | |
|---|---|---|
| Edu.r | -0.36 | 0.04 |
| Lab.r | 0.05 | 0.72 |
| Life.exp | -0.44 | -0.03 |
| Edu.exp | -0.43 | 0.14 |
| GNI.pc | -0.35 | 0.05 |
| Mat.mort | 0.44 | 0.15 |
| Ad.birth | 0.41 | 0.08 |
| Rep | -0.08 | 0.65 |
Now, let us examine a new dataset tea from the package Factominer. The data relates to a questionnaire about tea consumption. It has 300 observations and 36 variables. The variables include questions about how the tea is consumed and percieved, as well as few individual attributes. We are not going to analysise the entire dataset, instead we have selected 9 most interesting variables.
## sex how sugar How
## F:178 tea bag :170 No.sugar:155 alone:195
## M:122 tea bag+unpackaged: 94 sugar :145 lemon: 33
## unpackaged : 36 milk : 63
## other: 9
## escape.exoticism diuretic breakfast
## escape-exoticism :142 diuretic :174 breakfast :144
## Not.escape-exoticism:158 Not.diuretic:126 Not.breakfast:156
##
##
## lunch sophisticated
## lunch : 44 Not.sophisticated: 85
## Not.lunch:256 sophisticated :215
##
##
## 'data.frame': 300 obs. of 9 variables:
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
Next, lets perform the MCA on our data. The summary of our analysis is seen below. The eigenvalues tell us how much of the variance is retained by each dimension. We can see that half of the variance is retained by the 5 first dimensions. We can also note that none of the variables is highly associated with the three dimensions. This can be seen from the categorical values section, as none of the values is close to one.
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.163 0.154 0.129 0.121 0.118 0.116
## % of var. 12.214 11.560 9.647 9.050 8.885 8.694
## Cumulative % of var. 12.214 23.774 33.422 42.471 51.357 60.051
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.106 0.100 0.092 0.087 0.075 0.072
## % of var. 7.958 7.516 6.866 6.536 5.658 5.415
## Cumulative % of var. 68.009 75.525 82.391 88.927 94.585 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.724 1.072 0.477 | -0.266 0.153 0.064 | 0.050
## 2 | 0.282 0.162 0.061 | 0.676 0.990 0.350 | -0.580
## 3 | -0.232 0.110 0.059 | 0.149 0.048 0.025 | -0.088
## 4 | 0.212 0.092 0.052 | -0.739 1.182 0.630 | -0.230
## 5 | 0.269 0.149 0.070 | 0.201 0.087 0.039 | -0.392
## 6 | 0.221 0.100 0.046 | -0.319 0.220 0.095 | 0.074
## 7 | 0.411 0.346 0.156 | -0.026 0.001 0.001 | 0.034
## 8 | -0.111 0.025 0.011 | 0.018 0.001 0.000 | -0.196
## 9 | 0.267 0.146 0.052 | 0.295 0.188 0.064 | 0.167
## 10 | 0.146 0.044 0.018 | 0.426 0.393 0.155 | 0.275
## ctr cos2
## 1 0.007 0.002 |
## 2 0.873 0.258 |
## 3 0.020 0.009 |
## 4 0.137 0.061 |
## 5 0.399 0.149 |
## 6 0.014 0.005 |
## 7 0.003 0.001 |
## 8 0.099 0.035 |
## 9 0.072 0.020 |
## 10 0.195 0.064 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## F | -0.427 7.368 0.266 -8.911 | 0.286 3.497
## M | 0.622 10.750 0.266 8.911 | -0.417 5.102
## tea bag | 0.233 2.095 0.071 4.603 | -0.098 0.390
## tea bag+unpackaged | -0.135 0.387 0.008 -1.573 | 0.550 6.835
## unpackaged | -0.748 4.577 0.076 -4.774 | -0.975 8.225
## No.sugar | -0.549 10.634 0.322 -9.819 | 0.409 6.239
## sugar | 0.587 11.368 0.322 9.819 | -0.438 6.669
## alone | -0.304 4.101 0.172 -7.166 | -0.235 2.597
## lemon | 0.447 1.497 0.025 2.715 | -0.275 0.598
## milk | 0.789 8.913 0.165 7.032 | 0.741 8.319
## cos2 v.test Dim.3 ctr cos2 v.test
## F 0.119 5.972 | 0.035 0.061 0.002 0.722 |
## M 0.119 -5.972 | -0.050 0.089 0.002 -0.722 |
## tea bag 0.012 -1.932 | -0.554 15.015 0.401 -10.952 |
## tea bag+unpackaged 0.138 6.426 | 0.829 18.612 0.314 9.686 |
## unpackaged 0.130 -6.226 | 0.450 2.100 0.028 2.874 |
## No.sugar 0.179 7.317 | -0.025 0.028 0.001 -0.448 |
## sugar 0.179 -7.317 | 0.027 0.030 0.001 0.448 |
## alone 0.103 -5.548 | 0.091 0.470 0.016 2.156 |
## lemon 0.009 -1.670 | 1.037 10.213 0.133 6.303 |
## milk 0.146 6.609 | -0.600 6.540 0.096 -5.353 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.266 0.119 0.002 |
## how | 0.103 0.214 0.414 |
## sugar | 0.322 0.179 0.001 |
## How | 0.222 0.185 0.274 |
## escape.exoticism | 0.002 0.005 0.148 |
## diuretic | 0.086 0.220 0.090 |
## breakfast | 0.119 0.267 0.004 |
## lunch | 0.028 0.148 0.212 |
## sophisticated | 0.316 0.049 0.014 |
Finally, we will plot both the variables and the individuals. From the individual plot we can say, that there aren’t really any observations that stand out. The variable plot, on the other hand, we can see that, for example unpackaged differs from the rest.